Cluster analysis is one of the main tasks of exploratory data mining and is thus the topic of this week’s exercise. Clustering techniques identify similar groups or clusters among observations so that members within any segment are more similar while data across segments are different. However, defining what is meant by that requires often a lot of contextual knowledge and creativity.
The Boston data will be used. It can be loaded from the R package MASS.According to the ?Boston, the data frame has 506 rows (observations) of 14 columns (variables). Briefly, the data report several variables potentially explaining housing values around Boston. Our aim is to classify the included suburbs from Boston data set into classes based on their characteristics. The variables are:
crim: per capita crime rate by town
zn:proportion of residential land zoned for lots over 25,000 sq.ft.
indus:proportion of non-retail business acres per town.
chas:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox:nitrogen oxides concentration (parts per 10 million).
rm:average number of rooms per dwelling.
age:proportion of owner-occupied units built prior to 1940.
dis:weighted mean of distances to five Boston employment centres.
rad:index of accessibility to radial highways.
tax:full-value property-tax rate per $10,000.
ptratio:pupil-teacher ratio by town.
black:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat:lower status of the population (percent).
medv:median value of owner-occupied homes in $1000s
Firstly, the data are loaded, glimpsed and thereafter summaries are printed.
# load data
data("Boston")
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
The descriptive summary includes values of skewness (a measure of the symmetry in a distribution) and kurtosis (measuring the “tail-heaviness” of the distribution) and already shows the separating quantile values for crime to be further used later.
tab1<-CreateTableOne(vars=c("crim","zn", "indus","chas","nox" ,"rm"
,"age","dis" ,"rad" , "tax" ,"ptratio", "black"
,"lstat" , "medv"), factorVars = c("rad", "chas"),data=Boston)
summary(tab1)
##
## ### Summary of continuous variables ###
##
## strata: Overall
## n miss p.miss mean sd median p25 p75 min max skew
## crim 506 0 0 3.6 8.6 0.3 8e-02 3.7 6e-03 89.0 5.2
## zn 506 0 0 11.4 23.3 0.0 0e+00 12.5 0e+00 100.0 2.2
## indus 506 0 0 11.1 6.9 9.7 5e+00 18.1 5e-01 27.7 0.3
## nox 506 0 0 0.6 0.1 0.5 4e-01 0.6 4e-01 0.9 0.7
## rm 506 0 0 6.3 0.7 6.2 6e+00 6.6 4e+00 8.8 0.4
## age 506 0 0 68.6 28.1 77.5 5e+01 94.1 3e+00 100.0 -0.6
## dis 506 0 0 3.8 2.1 3.2 2e+00 5.2 1e+00 12.1 1.0
## tax 506 0 0 408.2 168.5 330.0 3e+02 666.0 2e+02 711.0 0.7
## ptratio 506 0 0 18.5 2.2 19.1 2e+01 20.2 1e+01 22.0 -0.8
## black 506 0 0 356.7 91.3 391.4 4e+02 396.2 3e-01 396.9 -2.9
## lstat 506 0 0 12.7 7.1 11.4 7e+00 17.0 2e+00 38.0 0.9
## medv 506 0 0 22.5 9.2 21.2 2e+01 25.0 5e+00 50.0 1.1
## kurt
## crim 37.13
## zn 4.03
## indus -1.23
## nox -0.06
## rm 1.89
## age -0.97
## dis 0.49
## tax -1.14
## ptratio -0.29
## black 7.23
## lstat 0.49
## medv 1.50
##
## =======================================================================================
##
## ### Summary of categorical variables ###
##
## strata: Overall
## var n miss p.miss level freq percent cum.percent
## chas 506 0 0.0 0 471 93.1 93.1
## 1 35 6.9 100.0
##
## rad 506 0 0.0 1 20 4.0 4.0
## 2 24 4.7 8.7
## 3 38 7.5 16.2
## 4 110 21.7 37.9
## 5 115 22.7 60.7
## 6 26 5.1 65.8
## 7 17 3.4 69.2
## 8 24 4.7 73.9
## 24 132 26.1 100.0
##
To get a better idea of the variables and their distributions some plots are generated.
#density plots for numerical variables
Boston %>%
keep(is.numeric) %>% # keep only numeric columns
gather() %>% # convert to key-value pairs
ggplot(aes(value)) + # plot the values
facet_wrap(~ key, scales = "free") + # in separate panels
geom_density() # as density
#histograms for integer variables
Boston %>%
keep(is.integer) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins=10)
Due to variable characteristics (percents, proportions or function based values) it is understandable that most of them have rather uneven/ skewed distributions: e.g. proportion of black people (scaled proportion of blacks), indus (proportion of non-retail business acres), age (proportion of owner-occupied units built prior to 1940), proportion of land zond for very large lots (zn) and lstat (lower status of the population (percent)). On the contrary, dwelling size referring to the number of rooms (rm) is normally distributed and median value of owner-occupied homes (medv) can also be judged to be. Charles River dummy variable (chas) is binary (1/0) referring to the river crossing the area and the radial highways accessibility (rad) is an interval scaled index.
#to get a better idea about variable crim it is plotted separately
plot(Boston$crim, col="red",pch=8, main="Crime rate per capita")
text(198,59," Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01 0.08 0.26 3.61 3.68 88.98" )
ggplot(Boston, aes(x = crim)) +
stat_density(aes(col="red"),position="identity",geom="line",size=2)+
ggtitle("Crime rate per capita")+ theme(legend.position="none")
We are especially interested in the variable crim. Thus, it is visualized separately. Crime rate varies a lot between areas ranging from min=0.01 to max=88.98. Quite a few high outlier values among the 506 observations as can be seen in the plot contribute to the low average value of 3.61 and median value of 0.26. The distribution is strongly skewed to the left.
To explore the relations between the variables of the data set pairwise scatter plots and a correlation plots are printed.
#scatterplots
pairs(Boston,lower.panel = NULL)
To me these scatter plots are not that informative, though. Thus, I will try another approach where the correlation chart presents simultanously both the direction (color) and the magnitude (size of the circle) as well as the values of the correlation.
#a more visual correlation matrix
cor_matrix<-cor(Boston) %>% round(2)
corrplot.mixed(cor_matrix,number.cex=0.65,tl.cex=0.6)
And, finally, I create just a conservative table of correlations with notions for significance levels with the help of this.
library(xtable)
corstars <-function(x, method=c("pearson", "spearman"), removeTriangle=c("upper", "lower"),
result=c("none", "html", "latex")){
#Compute correlation matrix
require(Hmisc)
x <- as.matrix(x)
correlation_matrix<-rcorr(x, type=method[1])
R <- correlation_matrix$r # Matrix of correlation coeficients
p <- correlation_matrix$P # Matrix of p-value
## Define notions for significance levels; spacing is important.
mystars <- ifelse(p < .0001, "****", ifelse(p < .001, "*** ", ifelse(p < .01, "** ", ifelse(p < .05, "* ", " "))))
## trunctuate the correlation matrix to two decimal
R <- format(round(cbind(rep(-1.11, ncol(x)), R), 2))[,-1]
## build a new matrix that includes the correlations with their apropriate stars
Rnew <- matrix(paste(R, mystars, sep=""), ncol=ncol(x))
diag(Rnew) <- paste(diag(R), " ", sep="")
rownames(Rnew) <- colnames(x)
colnames(Rnew) <- paste(colnames(x), "", sep="")
## remove upper triangle of correlation matrix
if(removeTriangle[1]=="upper"){
Rnew <- as.matrix(Rnew)
Rnew[upper.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove lower triangle of correlation matrix
else if(removeTriangle[1]=="lower"){
Rnew <- as.matrix(Rnew)
Rnew[lower.tri(Rnew, diag = TRUE)] <- ""
Rnew <- as.data.frame(Rnew)
}
## remove last column and return the correlation matrix
Rnew <- cbind(Rnew[1:length(Rnew)-1])
if (result[1]=="none") return(Rnew)
else{
if(result[1]=="html") print(xtable(Rnew), type="html")
else print(xtable(Rnew), type="latex")
}
}
# Make table with centered columns
corstars(Boston,result="html")
| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | black | lstat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| crim | |||||||||||||
| zn | -0.20**** | ||||||||||||
| indus | 0.41**** | -0.53**** | |||||||||||
| chas | -0.06 | -0.04 | 0.06 | ||||||||||
| nox | 0.42**** | -0.52**** | 0.76**** | 0.09* | |||||||||
| rm | -0.22**** | 0.31**** | -0.39**** | 0.09* | -0.30**** | ||||||||
| age | 0.35**** | -0.57**** | 0.64**** | 0.09 | 0.73**** | -0.24**** | |||||||
| dis | -0.38**** | 0.66**** | -0.71**** | -0.10* | -0.77**** | 0.21**** | -0.75**** | ||||||
| rad | 0.63**** | -0.31**** | 0.60**** | -0.01 | 0.61**** | -0.21**** | 0.46**** | -0.49**** | |||||
| tax | 0.58**** | -0.31**** | 0.72**** | -0.04 | 0.67**** | -0.29**** | 0.51**** | -0.53**** | 0.91**** | ||||
| ptratio | 0.29**** | -0.39**** | 0.38**** | -0.12** | 0.19**** | -0.36**** | 0.26**** | -0.23**** | 0.46**** | 0.46**** | |||
| black | -0.39**** | 0.18**** | -0.36**** | 0.05 | -0.38**** | 0.13** | -0.27**** | 0.29**** | -0.44**** | -0.44**** | -0.18**** | ||
| lstat | 0.46**** | -0.41**** | 0.60**** | -0.05 | 0.59**** | -0.61**** | 0.60**** | -0.50**** | 0.49**** | 0.54**** | 0.37**** | -0.37**** | |
| medv | -0.39**** | 0.36**** | -0.48**** | 0.18**** | -0.43**** | 0.70**** | -0.38**** | 0.25**** | -0.38**** | -0.47**** | -0.51**** | 0.33**** | -0.74**** |
From all the information above it can be captured that there are several relevant correlations between the variables. Strong negative correlations exist between weighted mean distances to five Boston employment centres (dis) and proportion of non-retail business acres per town (indus) / nitrogen oxide (nox) / and older properties (age). Not surprisingly, a strong negative correlation between lower status of the population (percent) and median home value (medv) is seen. Strong positive correlations especially between index of accessibility to radial highways (rad) and full-value property-tax rate per $10,000 (tax) / proportion of non-retail business acres per town (indus) exist. Proportion of non-retail business acres per town (indus) is further positively correlated with nitrogen oxide (nox) and full-value property tax-rate (tax). Furthermore, one of our main interests, the crime rate is correlated with many of the variables: e.g. negatively with e.g. distance to employment centers (dis) and median home value (medv), positively with e.g. full-value property tax-rate (tax) and access to radial highways (rad). Thus, an increase in crime rate seems to be associated with an increasing highway accessibility index and property tax.
Linear discriminant analysis is a method generating linear combinations to charachterize variable classes. To enable the method the data set needs to be standardized, i.e. all variables fit to normal distribution so that the mean of every variable is zero by ubtracting the column means from the corresponding columns and dividing the difference with standard deviation:
\[ scaled(x) = \frac{x-means(x)}{sd(x)} \]
#scale the dataset
boston_scaled <- as.data.frame(scale(Boston))
In comparison, the values of the latter table have decreased, and all mean values are converted to zero and standard deviations to 1.
In addition to scaling, a categorical variable of the scaled crime rate has to be created. Quantiles are used for this to yield four grouping values: low, medium low, medium high and high crime rates and thus four groups with approximatey equal numbers of observation each.
Next, the data set is randomly spit for the analysis to train (80%) and test (20%) sets. Thus, the train set has 404 and the test set 102 variables.
# create a quantile vector of crim, and use it to categorize crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# explore the categorised variable.
table(boston_scaled$crim)
##
## low med_low med_high high
## 127 126 126 127
To compare how the original values by the created groups differ between the groups a summary is created stratified by the different crime rate categories. There are many large differences, especially when comparing the high and low crime rate groups.
Boston2<-Boston
Boston2$crime<-boston_scaled$crim
CreateTableOne(vars=c("zn", "indus","chas","nox" ,"rm"
,"age","dis" ,"rad" , "tax" ,"ptratio", "black"
,"lstat" , "medv"), strata=c("crime"), test=FALSE, data=Boston2)
## Stratified by crime
## low med_low med_high
## n 127 126 126
## zn (mean (sd)) 33.85 (34.67) 9.11 (14.54) 2.41 (6.59)
## indus (mean (sd)) 4.88 (4.28) 9.14 (5.80) 12.41 (6.57)
## chas (mean (sd)) 0.04 (0.20) 0.06 (0.24) 0.12 (0.33)
## nox (mean (sd)) 0.45 (0.05) 0.49 (0.06) 0.60 (0.11)
## rm (mean (sd)) 6.60 (0.56) 6.19 (0.47) 6.36 (0.88)
## age (mean (sd)) 43.65 (19.59) 59.03 (29.05) 80.18 (21.91)
## dis (mean (sd)) 5.64 (2.08) 4.54 (1.93) 3.00 (1.25)
## rad (mean (sd)) 3.54 (1.59) 4.78 (1.50) 5.96 (4.28)
## tax (mean (sd)) 283.83 (63.32) 327.83 (102.02) 356.32 (91.50)
## ptratio (mean (sd)) 17.50 (1.89) 18.32 (1.64) 17.84 (2.86)
## black (mean (sd)) 391.25 (9.19) 386.14 (30.87) 364.64 (56.91)
## lstat (mean (sd)) 7.15 (2.79) 11.71 (5.53) 12.74 (6.91)
## medv (mean (sd)) 27.37 (8.00) 22.51 (5.38) 24.10 (10.28)
## Stratified by crime
## high
## n 127
## zn (mean (sd)) 0.00 (0.00)
## indus (mean (sd)) 18.11 (0.13)
## chas (mean (sd)) 0.06 (0.23)
## nox (mean (sd)) 0.68 (0.06)
## rm (mean (sd)) 6.00 (0.69)
## age (mean (sd)) 91.45 (9.95)
## dis (mean (sd)) 2.00 (0.55)
## rad (mean (sd)) 23.85 (1.69)
## tax (mean (sd)) 663.93 (23.34)
## ptratio (mean (sd)) 20.16 (0.49)
## black (mean (sd)) 284.96 (147.79)
## lstat (mean (sd)) 19.00 (6.85)
## medv (mean (sd)) 16.16 (8.63)
Linear Discriminant Analysis (LDA) model is carried out to classify the suburbs using the categorized crime rate as the target variable. Firstly, classification is performed on the training dataset, and thereafter the classes are predicted on the test data.
set.seed(123)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
Based on the analysis results are plotted firstly plain, and thereafter as a LDA (bi)plot with the help of a specificifally generated “arrow”-function to add arrows. It has to be kept in mind that for plotting the classes have to tranformed from categorical to numeric.
#helper function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#crime classes to numeric for plotting
classes <- as.numeric(train$crime)
#plotting the lda
p1<-plot(lda.fit, dimen = 2, col = classes, pch = classes)
#(bi)plot
p2<-plot(lda.fit, dimen = 2, col = classes, pch = classes)
#arrows
lda.arrows(lda.fit)
print(lda.fit)
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2500000 0.2475248 0.2549505 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 0.9432005 -0.8959713 -0.11640431 -0.8577158 0.44620905
## med_low -0.1512978 -0.3010310 -0.07547406 -0.5591764 -0.11907536
## med_high -0.4023185 0.2609362 0.18636222 0.4392319 0.04779161
## high -0.4872402 1.0149946 -0.07547406 1.0412268 -0.50541663
## age dis rad tax ptratio
## low -0.8546026 0.8232513 -0.6953230 -0.7314517 -0.45097009
## med_low -0.3467956 0.2952511 -0.5420083 -0.4653999 -0.03350366
## med_high 0.4318874 -0.4060664 -0.4087527 -0.2821208 -0.32611400
## high 0.7952399 -0.8387924 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.37835870 -0.76633833 0.51252310
## med_low 0.32163001 -0.18506538 0.01078532
## med_high 0.08166164 0.03210543 0.15583308
## high -0.74993410 0.87136020 -0.65964311
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08288224 0.74545012 -1.02758413
## indus 0.03371880 -0.27414286 0.12800485
## chas -0.07688231 -0.04569133 0.03313223
## nox 0.24620783 -0.70692830 -1.31233352
## rm -0.13308714 -0.06060304 -0.15792484
## age 0.29339946 -0.28245910 -0.17691125
## dis -0.04642187 -0.28428720 0.07074842
## rad 3.40029809 0.91155778 -0.10269503
## tax 0.04070331 -0.03257583 0.61283757
## ptratio 0.11422801 0.09094945 -0.21802056
## black -0.15410658 0.03229362 0.15457698
## lstat 0.19460392 -0.28843142 0.29476658
## medv 0.17982251 -0.40689877 -0.20349106
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9510 0.0368 0.0122
lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 16 10 0 0 26
## med_low 7 15 4 0 26
## med_high 1 6 15 1 23
## high 0 0 1 26 27
## Sum 24 31 20 27 102
By crosstabulating the correct and predicted classes it can be seen that the model’s predictions are quite accurate with the high category in the test data. On the contrary, the low areas are not recognized that well, and both the medium low and medium high classes seem to be problematic. By reflecting the results to the graph it can be captured that the separation is clearest with regard to the highest class. Due to that the prediction accuracies are understandable.
In comparison to LDA, K-means is a clustering method that divides observations into clusters.
For K means clustering, the Boston dataset is rescaled, so that the distances are comparable. To examine the distance properties of the data and compare methods superficially both the Euclidian (geometric) and Manhattan (along the axes) distance summaries are printed.
data(Boston)
#center and standardize variables and make it a data frame
boston_scaled<-as.data.frame(scale(Boston))
#Euclidean distance matrix
dist_eu<-dist(boston_scaled)
#for comparison Manhattan distance matrix
dist_man<-dist(boston_scaled,method = "manhattan" )
#summaries
summary(dist_eu)#Euclidian
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
summary(dist_man)#Manhattan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
K-means algorith is exploratorily ran on the dataset using 5 clusters.
#kmeans using euclidean and five clusters
km <- kmeans(dist_eu, centers = 5)
pairs(boston_scaled, col = km$cluster,lower.panel = NULL)
At the first sight the plotted reslts look a little like fireworks. Before interpreting the plot in more detail the optimal number of clusters using the total within cluster sum of squares (WCSS) with the number of clusters ranging from 1 to 10 is estimated and the results visualized.
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
#Plot the results
plot(1:k_max, twcss, type = 'b')
The most prominent change in the sum of squares happens at 2. However, there are still drops after two, too, but they only yield small improvements. Yet, I choose to carry out the k-means clustering again using 2 as the number of clusters.
km <- kmeans(dist_eu, centers = 2)
pairs(boston_scaled, col = km$cluster, lower.panel = NULL)
The new pairwise scatter plot with only two clusters looks better than the previous one. All data points are assigned to two red/black clusters. The clearer separation for the colours the more relevant for clustering the variable. Property tax and access to radial highways seem to discriminate quite well between the two clusters.
K-means is performed on the scaled Boston data using 4 clusters. Therafter, LDA is fitted using the generated clusters as target classes. Biplot is printed.
set.seed(123)
data(Boston)
boston_scaled4 <- as.data.frame(scale(Boston,center=TRUE,scale = TRUE))
dist_eu4 <- dist(boston_scaled4)
km2 <-kmeans(dist_eu4, centers = 4)
#pairs(boston_scaled4, col = km$cluster, lower.panel = NULL)
boston_scaled4$classes<-km2$cluster
lda.fit2 <- lda(boston_scaled4$classes ~., data = boston_scaled4)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda.fit2, dimen = 2, col= as.numeric(boston_scaled4$classes), pch=classes)
lda.arrows(lda.fit2, myscale = 2)
The given code is ran on the scaled train set. A matrix is created including projections of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
3D plot without colours is created:
library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
3D plot is created the categorized crime rate classes defining the colours:
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crim)
3D plot is created the last K-mean clusters defining the colours:
train$cluster<-km2$cluster[match(rownames(train),rownames(boston_scaled4))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cluster))